For this question, we have given with some code that could generate sin curve with some noise. Lets use plot the data below and store that the data in a dataframe for future use. Also I have displayed the top five data points in the data
## x
## 1 2.000000
## 2 2.080808
## 3 2.161616
## 4 2.242424
## 5 2.323232
## 6 2.404040
As per the instructions, I am using a seed(1) in order to have reporduciability of my code. The following table shows the parameters values that I have used to fit the different models using radial basis functions.
## costs epsilons
## 1 0.25 0.01
## 2 1.00 0.01
## 3 4.00 0.01
## 4 256.00 0.01
## 5 0.25 0.05
## 6 1.00 0.05
## 7 4.00 0.05
## 8 256.00 0.05
## 9 0.25 0.10
## 10 1.00 0.10
## 11 4.00 0.10
## 12 256.00 0.10
## 13 0.25 0.50
## 14 1.00 0.50
## 15 4.00 0.50
## 16 256.00 0.50
Thus above we have the plot for various cost and epsilon values parameter fitted over the given curve.
Now lets add the sigma parameter as a hyperparamer. Each lines shows different sigma values with different colors.
From the plot, we can see that increasing the cost will increase the model complexity. Thus model with low cost will have high bias while the model with high cost will have high variance.
From the above plot in the last column, we can see that increasing the epsilon value decreases that model complexity. The plot on the upper right corner has cost of 256 and epsilon of 0.91 with high sigma value is shown in red color. It looks overfitting the data but the plot in the bottom of the same columnn is having less complex model compared tmo the top one. We can say that increasing the epsilon will make the model smooth.
The increase in the sigma values is overfiting the data. We can clearly see that the green line with low sigma value is having a very simple fit(high bias) compared to red line with very complex fit(high variance).
For this question we are going to work on the Friedman’s dataset created by a simulation. mlbench package in R provides this data so that we can get it and use it. I going to load the train and test set of the data as it was directed in the book.
Thus the above plot shows that we have succesfully loaded the train and test set.
As given in the book, lets first fit the data with a KNN model. For this I have used train() in the caret package with knn as method. Below we have complete information about the KNN model.
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.626071 0.4766532 2.954620
## 7 3.463866 0.5231142 2.823065
## 9 3.387872 0.5552706 2.744969
## 11 3.345212 0.5754355 2.706971
## 13 3.317962 0.5964912 2.696927
## 15 3.291614 0.6123880 2.668077
## 17 3.293991 0.6238074 2.665458
## 19 3.300870 0.6342793 2.668902
## 21 3.297502 0.6472145 2.674792
## 23 3.298045 0.6534698 2.676965
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
## RMSE Rsquared MAE
## 3.1750657 0.6785946 2.5443169
From the above plot and the tabulated result, we can clearly see that the best model has the k value of 15. That is the best R^2 is obtain when we are considering 15 nearest neighbours. The R^2 value obtained on the test set is 0.6786.
Lets bulid the model using neural network model using train() in caret package with method as nnet. I am setting the decay to be 0, 0.01, 0.1 and size varying from 1 to 10.
## Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.680325 0.7141610 2.098819
## 0.00 2 2.839397 0.6861113 2.235487
## 0.00 3 2.977969 0.6701142 2.341324
## 0.00 4 3.289448 0.6282245 2.501787
## 0.00 5 4.953705 0.4986515 3.165885
## 0.00 6 5.434198 0.4491815 3.316184
## 0.00 7 8.726366 0.2937632 4.601882
## 0.00 8 10.163375 0.3888125 5.030177
## 0.00 9 4.265547 0.5062168 3.216742
## 0.00 10 4.393696 0.4964493 3.338989
## 0.01 1 2.502881 0.7525307 1.935655
## 0.01 2 2.804639 0.6970563 2.179830
## 0.01 3 2.747941 0.7113945 2.159689
## 0.01 4 3.065408 0.6636576 2.394201
## 0.01 5 3.239405 0.6349742 2.538847
## 0.01 6 3.618123 0.5746367 2.821578
## 0.01 7 3.935655 0.5223489 3.084171
## 0.01 8 3.831064 0.5389188 3.021469
## 0.01 9 3.650763 0.5720477 2.911138
## 0.01 10 3.838880 0.5328379 3.062723
## 0.10 1 2.505958 0.7513321 1.934889
## 0.10 2 2.790805 0.6999236 2.190519
## 0.10 3 2.749609 0.7158989 2.159751
## 0.10 4 2.825735 0.7007805 2.221362
## 0.10 5 2.965525 0.6799272 2.336499
## 0.10 6 3.354690 0.6172005 2.632584
## 0.10 7 3.335323 0.6168686 2.654745
## 0.10 8 3.712228 0.5392484 2.970709
## 0.10 9 3.432621 0.6031285 2.696727
## 0.10 10 3.321041 0.6249708 2.670665
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.01.
## RMSE Rsquared MAE
## 2.6433262 0.7194364 2.0232660
From the above plot and the tabulated result, we can clearly see that the best model of neural network model is of size 1 and decay is 0.01. The R^2 value obtained on the test set is 0.7194.
Lets bulid the model using averaged neural network model using train() in caret package with method as avNNet. I am setting the decay to be 0, 0.001, 0.01, 0.1 and size varying from 1 to 10 with bag as False.
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.546337 0.7439541 1.968793
## 0.00 2 2.581790 0.7400278 2.012679
## 0.00 3 2.535763 0.7515368 1.920367
## 0.00 4 2.922345 0.6958074 2.144579
## 0.00 5 3.795250 0.5939801 2.576917
## 0.00 6 4.841328 0.4831421 3.176902
## 0.00 7 4.827818 0.4718842 3.234973
## 0.00 8 4.805434 0.4642360 3.123138
## 0.00 9 4.019746 0.5945787 2.592198
## 0.00 10 3.257907 0.6450295 2.353564
## 0.01 1 2.500329 0.7521578 1.928185
## 0.01 2 2.569620 0.7404300 2.006646
## 0.01 3 2.339930 0.7852714 1.832702
## 0.01 4 2.424232 0.7737201 1.917992
## 0.01 5 2.510170 0.7522292 1.974175
## 0.01 6 2.699310 0.7275242 2.123788
## 0.01 7 2.819939 0.7043841 2.212574
## 0.01 8 2.673350 0.7251305 2.102813
## 0.01 9 2.647719 0.7301051 2.064986
## 0.01 10 2.637050 0.7280176 2.086017
## 0.10 1 2.505918 0.7513368 1.934844
## 0.10 2 2.563253 0.7429928 1.989915
## 0.10 3 2.303527 0.7889688 1.814756
## 0.10 4 2.334351 0.7835149 1.842384
## 0.10 5 2.426563 0.7702325 1.913187
## 0.10 6 2.556781 0.7469632 2.015201
## 0.10 7 2.541515 0.7480649 2.008697
## 0.10 8 2.520092 0.7524492 1.993615
## 0.10 9 2.442685 0.7664654 1.927898
## 0.10 10 2.445824 0.7672685 1.927857
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3, decay = 0.1 and bag
## = FALSE.
## RMSE Rsquared MAE
## 2.050326 0.833895 1.546593
From the above plot and the tabulated result, we can clearly see that the best model of averaged neural network model has weight decay value of 0.1 and size as 3. The R^2 value obtained on the test set is 0.8339.
Lets bulid the model using mars model using train() in caret package with method as earth and preprocessing. I am setting the degree to be 1, 2, 3 and number of prune varying from 2 to 38.
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.510221 0.2017622 3.719596
## 1 3 3.696575 0.4621782 2.992075
## 1 4 2.776976 0.6965425 2.226214
## 1 5 2.535576 0.7442072 2.021430
## 1 6 2.396059 0.7748952 1.910341
## 1 7 1.997451 0.8427290 1.588618
## 1 8 1.878071 0.8607824 1.486225
## 1 9 1.806903 0.8709898 1.408215
## 1 10 1.794167 0.8728956 1.387328
## 1 11 1.796483 0.8729617 1.390498
## 1 12 1.810352 0.8714632 1.403777
## 1 13 1.808039 0.8718113 1.395422
## 1 14 1.818921 0.8702901 1.405284
## 1 15 1.828042 0.8690235 1.416676
## 1 16 1.845366 0.8672717 1.429085
## 1 17 1.845366 0.8672717 1.429085
## 1 18 1.845366 0.8672717 1.429085
## 1 19 1.845366 0.8672717 1.429085
## 1 20 1.845366 0.8672717 1.429085
## 1 21 1.845366 0.8672717 1.429085
## 1 22 1.845366 0.8672717 1.429085
## 1 23 1.845366 0.8672717 1.429085
## 1 24 1.845366 0.8672717 1.429085
## 1 25 1.845366 0.8672717 1.429085
## 1 26 1.845366 0.8672717 1.429085
## 1 27 1.845366 0.8672717 1.429085
## 1 28 1.845366 0.8672717 1.429085
## 1 29 1.845366 0.8672717 1.429085
## 1 30 1.845366 0.8672717 1.429085
## 1 31 1.845366 0.8672717 1.429085
## 1 32 1.845366 0.8672717 1.429085
## 1 33 1.845366 0.8672717 1.429085
## 1 34 1.845366 0.8672717 1.429085
## 1 35 1.845366 0.8672717 1.429085
## 1 36 1.845366 0.8672717 1.429085
## 1 37 1.845366 0.8672717 1.429085
## 1 38 1.845366 0.8672717 1.429085
## 2 2 4.510138 0.2017189 3.719805
## 2 3 3.705720 0.4595804 2.997697
## 2 4 2.761341 0.7004236 2.205055
## 2 5 2.527292 0.7476576 2.003029
## 2 6 2.435956 0.7666567 1.918751
## 2 7 2.002104 0.8394419 1.592260
## 2 8 1.883270 0.8591193 1.467498
## 2 9 1.737783 0.8800273 1.352830
## 2 10 1.611688 0.8960040 1.263020
## 2 11 1.498473 0.9099346 1.183164
## 2 12 1.458763 0.9144646 1.162416
## 2 13 1.459539 0.9159286 1.146368
## 2 14 1.480467 0.9140997 1.159628
## 2 15 1.499974 0.9116557 1.173345
## 2 16 1.520221 0.9099219 1.183261
## 2 17 1.530444 0.9085721 1.191702
## 2 18 1.530076 0.9085017 1.186946
## 2 19 1.532100 0.9081181 1.187963
## 2 20 1.532100 0.9081181 1.187963
## 2 21 1.532100 0.9081181 1.187963
## 2 22 1.532100 0.9081181 1.187963
## 2 23 1.532100 0.9081181 1.187963
## 2 24 1.532100 0.9081181 1.187963
## 2 25 1.532100 0.9081181 1.187963
## 2 26 1.532100 0.9081181 1.187963
## 2 27 1.532100 0.9081181 1.187963
## 2 28 1.532100 0.9081181 1.187963
## 2 29 1.532100 0.9081181 1.187963
## 2 30 1.532100 0.9081181 1.187963
## 2 31 1.532100 0.9081181 1.187963
## 2 32 1.532100 0.9081181 1.187963
## 2 33 1.532100 0.9081181 1.187963
## 2 34 1.532100 0.9081181 1.187963
## 2 35 1.532100 0.9081181 1.187963
## 2 36 1.532100 0.9081181 1.187963
## 2 37 1.532100 0.9081181 1.187963
## 2 38 1.532100 0.9081181 1.187963
## 3 2 4.510138 0.2017189 3.719805
## 3 3 3.705720 0.4595804 2.997697
## 3 4 2.761341 0.7004236 2.205055
## 3 5 2.527292 0.7476576 2.003029
## 3 6 2.435956 0.7666567 1.918751
## 3 7 2.002104 0.8394419 1.592260
## 3 8 1.883270 0.8591193 1.467498
## 3 9 1.737783 0.8800273 1.352830
## 3 10 1.616709 0.8953253 1.269326
## 3 11 1.502878 0.9095382 1.185268
## 3 12 1.464964 0.9137778 1.164108
## 3 13 1.473203 0.9147496 1.154037
## 3 14 1.477041 0.9139106 1.159154
## 3 15 1.504577 0.9112614 1.177719
## 3 16 1.514711 0.9102665 1.178596
## 3 17 1.520085 0.9091042 1.182633
## 3 18 1.525661 0.9080718 1.187705
## 3 19 1.524936 0.9081347 1.186952
## 3 20 1.523364 0.9082714 1.185232
## 3 21 1.523364 0.9082714 1.185232
## 3 22 1.523364 0.9082714 1.185232
## 3 23 1.523364 0.9082714 1.185232
## 3 24 1.523364 0.9082714 1.185232
## 3 25 1.523364 0.9082714 1.185232
## 3 26 1.523364 0.9082714 1.185232
## 3 27 1.523364 0.9082714 1.185232
## 3 28 1.523364 0.9082714 1.185232
## 3 29 1.523364 0.9082714 1.185232
## 3 30 1.523364 0.9082714 1.185232
## 3 31 1.523364 0.9082714 1.185232
## 3 32 1.523364 0.9082714 1.185232
## 3 33 1.523364 0.9082714 1.185232
## 3 34 1.523364 0.9082714 1.185232
## 3 35 1.523364 0.9082714 1.185232
## 3 36 1.523364 0.9082714 1.185232
## 3 37 1.523364 0.9082714 1.185232
## 3 38 1.523364 0.9082714 1.185232
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 12 and degree = 2.
## RMSE Rsquared MAE
## 1.2803060 0.9335241 1.0168673
From the above plot and the tabulated result, we can clearly see that the best model of mars model has nprune = 12 and degree = 2. The R^2 value obtained on the test set is 0.9335
## RMSE Rsquared MAE
## 1.6543157 0.8890979 1.3021739
From the above plot and the tabulated result, we can clearly see that the best model of mars model has nprune = 12 and degree = 2. The R^2 value obtained on the test set is 0.8891.
Thus by comparing to the above MARS model with no preprocessing R square value of 0.9335 this is quite low. Thus MARS model with no spatial transfrom is performing better on comparsion.
Lets bulid the model using support vector machine using train() in caret package with svmRadial method which uses radial basis function. For svm, I set the tune length to be 14 as it tuneLength argument will use the default grid search of 20 cost values between 2^-2, 2^-1, . . . , 2^11. sigma is estimated analytically by default
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 4.978256 0.1218695 4.085782
## 0.50 4.948892 0.1339104 4.059287
## 1.00 4.924717 0.1424149 4.040261
## 2.00 4.913855 0.1525388 4.031108
## 4.00 4.914246 0.1529141 4.030790
## 8.00 4.914246 0.1529141 4.030790
## 16.00 4.914246 0.1529141 4.030790
## 32.00 4.914246 0.1529141 4.030790
## 64.00 4.914246 0.1529141 4.030790
## 128.00 4.914246 0.1529141 4.030790
## 256.00 4.914246 0.1529141 4.030790
## 512.00 4.914246 0.1529141 4.030790
## 1024.00 4.914246 0.1529141 4.030790
## 2048.00 4.914246 0.1529141 4.030790
##
## Tuning parameter 'sigma' was held constant at a value of 0.06444911
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06444911 and C = 2.
## RMSE Rsquared MAE
## 4.8632139 0.2051648 3.9592115
From the above plot and the tabulated result, we can clearly see that the final values used for the model were sigma = 0.06444911 and C = 2. The R^2 value obtained on the test set is 0.2052.
Thus from buliding up all the above models, I have the following table.
| Model | Parameter | Training RMSE | Training R Squared | Testing RMSE | Testing R Squared | |
|---|---|---|---|---|---|---|
| KNN | k as 15 | 3.2916 | 0.6124 | 3.1751 | 0.6786 | |
| Neural Network | size as 1 & decay as 0.01 | 2.5028 | 0.7526 | 2.6433 | 0.7194 | |
| Averaged Neural Network | size as 3 & decay as 0.1 | 2.3035 | 0.789 | 2.050 | 0.8339 | |
| MARS with no Preprocessing | nprune as 12 & degree as 2 | 1.4588 | 0.9145 | 1.2803 | 0.9335 | |
| MARS with spatial sign and correlated predictors removed | nprune as 12 & degree as 2 | 1.7686 | 0.8751 | 1.6543 | 0.8891 | |
| Support Vector Machine | C as 2 & sigma as 0.06445 | 4.9139 | 0.1526 | 4.8632 | 0.2052 |
So from the table we can see that MARS model with no preprocessing gives best prediction as it has highest R^2 value 0.9335 on the test set. To answer the question regarding the predictor’s importance on the MARS model I used varImp() and plotted below plot.
Now from the above plot we can clearly see that MARS model select only the informative predcitors that is x1 - x5 (in the order x2, x4, x1, x5, x3), as x6 to x10 has values as 0.
For this question we are going to work with the Tecator dataset and bulid various non linear regression models over it. Tecator dataset is a combination of absorp contains the 100 absorbance values for the 215 samples and endpoints contains the percent of moisture, fat, and protein in columns 1–3, respectively. Lets first load the data here.
## V1 V2 V3 V4 V5
## 1 2.61776 2.61814 2.61859 2.61912 2.61981
## 2 2.83454 2.83871 2.84283 2.84705 2.85138
## 3 2.58284 2.58458 2.58629 2.58808 2.58996
## 4 2.82286 2.82460 2.82630 2.82814 2.83001
## 5 2.78813 2.78989 2.79167 2.79350 2.79538
## 6 3.00993 3.01540 3.02086 3.02634 3.03190
## mositure fat protein
## 1 60.5 22.5 16.7
## 2 46.0 40.1 13.5
## 3 71.0 8.4 20.5
## 4 72.8 5.9 20.7
## 5 58.3 25.5 15.5
## 6 44.0 42.7 13.7
Thus we have above the data loaded properly. Lets get started with buliding the models.
Before we generate the model we need to split the data into training and testing samples. Given the sample size, we will retain the 80% of the samples to the training set and 20% of the sample in the testing set. The train set will be used to tune the models by splitting that into 10 fold for cross validation in order to have better model performance. For spliting the train set we will use Leave group out cross validation with 5 folds.
Lets first fit the data with a KNN model. For this I have used train() in the caret package with knn as method. Below we have complete information about the KNN model.
## k-Nearest Neighbors
##
## 174 samples
## 100 predictors
##
## Pre-processing: centered (100), scaled (100)
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 174, 174, 174, 174, 174, 174, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 9.907561 0.4487340 7.395216
## 7 9.914853 0.4416734 7.620026
## 9 10.007944 0.4310662 7.796599
## 11 10.143735 0.4216959 8.032671
## 13 10.315272 0.4055667 8.253291
## 15 10.509426 0.3858847 8.452882
## 17 10.706905 0.3648644 8.630340
## 19 10.858457 0.3491532 8.756594
## 21 11.000344 0.3349530 8.893459
## 23 11.108659 0.3231675 8.991943
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
## RMSE Rsquared MAE
## 8.3321816 0.4499784 6.3130081
From the above plot and the tabulated result, we can clearly see that the best model has the k value of 5. That is the best R^2 is obtain when we are considering 5 nearest neighbours. The R^2 value obtained on the test set is 0.45.
Lets bulid the model using neural network model using train() in caret package with method as nnet. Here we are buliding it without PCA. I am setting the decay to be 0, 0.001, 0.01, 0.1 and size varying from 1 to 10.
## Neural Network
##
## 174 samples
## 100 predictors
##
## Pre-processing: centered (100), scaled (100)
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 174, 174, 174, 174, 174, 174, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 4.0274029 0.8707140 2.7064810
## 0.00 2 3.3918999 0.9149002 2.1840546
## 0.00 3 2.2536621 0.9675529 1.3435654
## 0.00 4 2.3134991 0.9622970 1.3814594
## 0.00 5 2.4193988 0.9565166 1.2195736
## 0.00 6 2.6292965 0.9477805 1.2416843
## 0.00 7 2.0429444 0.9708908 1.1736787
## 0.00 8 2.1858990 0.9686163 1.2100587
## 0.00 9 5.0655687 0.8340335 2.2747451
## 0.00 10 4.1515098 0.8992701 1.7147063
## 0.01 1 1.7023423 0.9835957 1.2973041
## 0.01 2 1.2300328 0.9908059 0.8871585
## 0.01 3 1.0373540 0.9926728 0.7341144
## 0.01 4 0.8409236 0.9953549 0.5964121
## 0.01 5 0.9487652 0.9932432 0.6528657
## 0.01 6 0.9333352 0.9942958 0.6419682
## 0.01 7 0.9855168 0.9932686 0.7005875
## 0.01 8 1.0702981 0.9933602 0.7114297
## 0.01 9 1.0104963 0.9941262 0.7101267
## 0.01 10 1.1082070 0.9927349 0.7680613
## 0.10 1 2.1262044 0.9743860 1.6912023
## 0.10 2 1.1124714 0.9917564 0.8465395
## 0.10 3 1.0092120 0.9929286 0.7631179
## 0.10 4 0.9308673 0.9945179 0.7067223
## 0.10 5 0.9502652 0.9945364 0.7031983
## 0.10 6 1.0846410 0.9914797 0.8022170
## 0.10 7 1.0479175 0.9935813 0.7661535
## 0.10 8 1.0539272 0.9935898 0.7751945
## 0.10 9 1.1198449 0.9924372 0.7989222
## 0.10 10 1.1904466 0.9916675 0.8556912
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4 and decay = 0.01.
## RMSE Rsquared MAE
## 0.4271682 0.9987065 0.3236428
From the above plot and the tabulated result, we can clearly see that the best model of neural network model is of size 4 and decay is 0.01. The R^2 value obtained on the test set is 0.9987.
Let us bulid the model using PCA in the preprocessing argument of the train(). I am setting the decay to be 0, 0.001, 0.01, 0.1 and size varying from 1 to 10.
## Neural Network
##
## 174 samples
## 100 predictors
##
## Pre-processing: centered (100), scaled (100), principal component
## signal extraction (100)
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 174, 174, 174, 174, 174, 174, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 12.18957 0.17306810 9.752068
## 0.00 2 12.07030 0.19394648 9.456260
## 0.00 3 12.33844 0.18108846 9.618613
## 0.00 4 17.23042 0.16965496 10.496844
## 0.00 5 14.49690 0.16442927 10.477867
## 0.00 6 15.86325 0.13896938 10.776789
## 0.00 7 15.32804 0.12527628 10.873554
## 0.00 8 26.41007 0.10204171 13.027818
## 0.00 9 26.87176 0.09186633 13.184149
## 0.00 10 28.45709 0.10217491 13.527934
## 0.01 1 12.08913 0.19165966 9.649027
## 0.01 2 12.26940 0.17826014 9.644300
## 0.01 3 12.43257 0.17869613 9.614212
## 0.01 4 12.56188 0.17500362 9.737005
## 0.01 5 12.69277 0.18150708 9.860975
## 0.01 6 13.01163 0.16285676 9.946475
## 0.01 7 13.24589 0.15658427 10.143004
## 0.01 8 13.50368 0.15374787 10.252936
## 0.01 9 14.06039 0.15797106 10.432378
## 0.01 10 14.61695 0.13208085 10.712957
## 0.10 1 12.26677 0.16607621 9.845121
## 0.10 2 12.00036 0.20539528 9.405782
## 0.10 3 12.15547 0.20013157 9.493548
## 0.10 4 12.39563 0.18840267 9.605633
## 0.10 5 12.69874 0.16371904 9.838733
## 0.10 6 12.74032 0.17380835 9.782252
## 0.10 7 13.17575 0.15112692 10.055735
## 0.10 8 13.10992 0.18480536 9.970003
## 0.10 9 13.90083 0.14257811 10.364962
## 0.10 10 13.54059 0.15092712 10.238430
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 2 and decay = 0.1.
## RMSE Rsquared MAE
## 9.5492911 0.2661236 7.6991560
From the above plot and the tabulated result, we can clearly see that the best model of neural network model is of size 2 and decay is 0.1. The R^2 value obtained on the test set is 0.2661. It is very low compared to neural network without pcs. Thus PCA is not helping much.
Lets bulid the model using averaged neural network model using train() in caret package with method as avNNet. I am setting the decay to be 0, 0.001, 0.01, 0.1 and size varying from 1 to 10 with bag as False.
## Model Averaged Neural Network
##
## 174 samples
## 100 predictors
##
## Pre-processing: centered (100), scaled (100)
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 174, 174, 174, 174, 174, 174, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 3.5242925 0.9328497 2.5045521
## 0.00 2 2.6571279 0.9583170 1.7362607
## 0.00 3 1.9201062 0.9738883 1.1502046
## 0.00 4 1.6349793 0.9841445 0.9848295
## 0.00 5 2.0133071 0.9703377 1.0801209
## 0.00 6 1.9676251 0.9688311 0.9854313
## 0.00 7 1.7264409 0.9798391 0.9438506
## 0.00 8 1.5669336 0.9851564 0.9096512
## 0.00 9 2.6359269 0.9427061 1.2948829
## 0.00 10 2.4005105 0.9571086 1.1634948
## 0.01 1 1.7042312 0.9835651 1.2983298
## 0.01 2 1.0029831 0.9942247 0.7364399
## 0.01 3 0.7946900 0.9963027 0.5879969
## 0.01 4 0.6934589 0.9972539 0.5039293
## 0.01 5 0.7170243 0.9970049 0.5162669
## 0.01 6 0.7232355 0.9970309 0.5259775
## 0.01 7 0.7502219 0.9966700 0.5417188
## 0.01 8 0.7136120 0.9970212 0.5240492
## 0.01 9 0.7782966 0.9964293 0.5604093
## 0.01 10 0.7869717 0.9964242 0.5787073
## 0.10 1 2.1284359 0.9743401 1.6929244
## 0.10 2 1.0607973 0.9935692 0.8107018
## 0.10 3 0.9055211 0.9954107 0.6909658
## 0.10 4 0.8275367 0.9961154 0.6323710
## 0.10 5 0.8514127 0.9957906 0.6385393
## 0.10 6 0.8927609 0.9953755 0.6765584
## 0.10 7 0.9224391 0.9950162 0.6956041
## 0.10 8 0.9477885 0.9946943 0.7065720
## 0.10 9 0.9932834 0.9941829 0.7347208
## 0.10 10 1.0517201 0.9934280 0.7704496
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.01 and bag
## = FALSE.
## RMSE Rsquared MAE
## 0.3869129 0.9988571 0.3071895
From the above plot and the tabulated result, we can clearly see that the best model of averaged neural network model has weight decay value of 0.1 and size as 4. The R^2 value obtained on the test set is 0.9989.
Lets bulid the model using mars model with train() in caret package with method as earth. I am setting the degree to be 1, 2, 3 and number of prune varying from 2 to 38.
## Multivariate Adaptive Regression Spline
##
## 174 samples
## 100 predictors
##
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 131, 131, 131, 131, 131, 131, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 10.882638 0.3213137 8.804301
## 1 3 9.605814 0.4753542 7.440073
## 1 4 7.538237 0.6794047 5.635368
## 1 5 4.944785 0.8646814 3.568613
## 1 6 4.191577 0.9032519 3.190915
## 1 7 3.504226 0.9339337 2.741468
## 1 8 3.555427 0.9307751 2.721168
## 1 9 3.330244 0.9399755 2.535762
## 1 10 3.314323 0.9398287 2.464012
## 1 11 3.092622 0.9469798 2.342433
## 1 12 3.062935 0.9482436 2.334861
## 1 13 3.046961 0.9490443 2.295845
## 1 14 3.022933 0.9491133 2.290815
## 1 15 3.101423 0.9462621 2.312860
## 1 16 3.046201 0.9482041 2.266685
## 1 17 3.016728 0.9491976 2.260351
## 1 18 3.022724 0.9491082 2.256020
## 1 19 3.025599 0.9491135 2.256206
## 1 20 3.023716 0.9491272 2.253802
## 1 21 3.033046 0.9488485 2.256340
## 1 22 3.033046 0.9488485 2.256340
## 1 23 3.033046 0.9488485 2.256340
## 1 24 3.033046 0.9488485 2.256340
## 1 25 3.033046 0.9488485 2.256340
## 1 26 3.033046 0.9488485 2.256340
## 1 27 3.033046 0.9488485 2.256340
## 1 28 3.033046 0.9488485 2.256340
## 1 29 3.033046 0.9488485 2.256340
## 1 30 3.033046 0.9488485 2.256340
## 1 31 3.033046 0.9488485 2.256340
## 1 32 3.033046 0.9488485 2.256340
## 1 33 3.033046 0.9488485 2.256340
## 1 34 3.033046 0.9488485 2.256340
## 1 35 3.033046 0.9488485 2.256340
## 1 36 3.033046 0.9488485 2.256340
## 1 37 3.033046 0.9488485 2.256340
## 1 38 3.033046 0.9488485 2.256340
## 2 2 10.882638 0.3213137 8.804301
## 2 3 9.655935 0.4691248 7.518094
## 2 4 7.535579 0.6790415 5.653381
## 2 5 4.843011 0.8715593 3.508081
## 2 6 4.342596 0.8984904 3.265936
## 2 7 3.528121 0.9328108 2.742578
## 2 8 3.460461 0.9364515 2.635235
## 2 9 3.218843 0.9458286 2.433121
## 2 10 3.241067 0.9439320 2.410856
## 2 11 3.134069 0.9446240 2.319778
## 2 12 3.011658 0.9492689 2.215820
## 2 13 2.987953 0.9492075 2.171037
## 2 14 3.229807 0.9443476 2.190484
## 2 15 3.096594 0.9494481 2.106684
## 2 16 3.039415 0.9515502 2.046034
## 2 17 2.942303 0.9545697 1.996892
## 2 18 2.873597 0.9564264 1.949334
## 2 19 2.851413 0.9571525 1.938927
## 2 20 2.785202 0.9586330 1.898244
## 2 21 2.760601 0.9592890 1.883520
## 2 22 2.745493 0.9597077 1.875384
## 2 23 2.746077 0.9597077 1.873338
## 2 24 2.730874 0.9600733 1.859675
## 2 25 2.732238 0.9600345 1.858422
## 2 26 2.732238 0.9600345 1.858422
## 2 27 2.732238 0.9600345 1.858422
## 2 28 2.732238 0.9600345 1.858422
## 2 29 2.732238 0.9600345 1.858422
## 2 30 2.732238 0.9600345 1.858422
## 2 31 2.732238 0.9600345 1.858422
## 2 32 2.732238 0.9600345 1.858422
## 2 33 2.732238 0.9600345 1.858422
## 2 34 2.732238 0.9600345 1.858422
## 2 35 2.732238 0.9600345 1.858422
## 2 36 2.732238 0.9600345 1.858422
## 2 37 2.732238 0.9600345 1.858422
## 2 38 2.732238 0.9600345 1.858422
## 3 2 10.882638 0.3213137 8.804301
## 3 3 9.702378 0.4635631 7.573560
## 3 4 7.550224 0.6760330 5.661790
## 3 5 4.861675 0.8703161 3.510451
## 3 6 4.259329 0.9014792 3.250413
## 3 7 3.531824 0.9327056 2.751938
## 3 8 3.451401 0.9363189 2.639899
## 3 9 3.312371 0.9414596 2.474992
## 3 10 3.173599 0.9457034 2.337047
## 3 11 3.120494 0.9444591 2.288823
## 3 12 3.018431 0.9486736 2.205814
## 3 13 2.917149 0.9515388 2.167526
## 3 14 3.191303 0.9460618 2.211166
## 3 15 3.088076 0.9496868 2.123090
## 3 16 3.041126 0.9511874 2.054220
## 3 17 2.916878 0.9552198 1.995997
## 3 18 2.925824 0.9549296 1.999696
## 3 19 2.919763 0.9549975 1.976350
## 3 20 2.835689 0.9571031 1.925702
## 3 21 2.847999 0.9567770 1.917946
## 3 22 2.827825 0.9573626 1.910699
## 3 23 2.827825 0.9573626 1.910699
## 3 24 2.829733 0.9573196 1.909537
## 3 25 2.829733 0.9573196 1.909537
## 3 26 2.829733 0.9573196 1.909537
## 3 27 2.829733 0.9573196 1.909537
## 3 28 2.829733 0.9573196 1.909537
## 3 29 2.829733 0.9573196 1.909537
## 3 30 2.829733 0.9573196 1.909537
## 3 31 2.829733 0.9573196 1.909537
## 3 32 2.829733 0.9573196 1.909537
## 3 33 2.829733 0.9573196 1.909537
## 3 34 2.829733 0.9573196 1.909537
## 3 35 2.829733 0.9573196 1.909537
## 3 36 2.829733 0.9573196 1.909537
## 3 37 2.829733 0.9573196 1.909537
## 3 38 2.829733 0.9573196 1.909537
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 24 and degree = 2.
## RMSE Rsquared MAE
## 1.3920623 0.9849597 1.1936344
From the above plot and the tabulated result, we can clearly see that the best model of mars model has nprune = 24 and degree = 2. The R^2 value obtained on the test set is 0.985.
## RMSE Rsquared MAE
## 5.3556324 0.8156073 3.9938803
From the above plot and the tabulated result, we can clearly see that the best model of mars model has nprune = 13 and degree = 1. The R^2 value obtained on the test set is 0.8156.
Lets bulid the model using support vector machine using train() in caret package with svmRadial method which uses radial basis function. For svm, I set the tune length to be 14 as it tuneLength argument will use the default grid search of 20 cost values between 2^-2, 2^-1, . . . , 2^11. sigma is estimated analytically by default
## Support Vector Machines with Radial Basis Function Kernel
##
## 174 samples
## 100 predictors
##
## Pre-processing: centered (100), scaled (100)
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 131, 131, 131, 131, 131, 131, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 11.348009 0.4953144 8.682498
## 0.50 9.679274 0.5886628 7.503213
## 1.00 8.217206 0.6529025 6.299173
## 2.00 7.613297 0.6873605 5.769243
## 4.00 7.585359 0.6884891 5.718025
## 8.00 7.585357 0.6884891 5.718000
## 16.00 7.585357 0.6884891 5.718000
## 32.00 7.585357 0.6884891 5.718000
## 64.00 7.585357 0.6884891 5.718000
## 128.00 7.585357 0.6884891 5.718000
## 256.00 7.585357 0.6884891 5.718000
## 512.00 7.585357 0.6884891 5.718000
## 1024.00 7.585357 0.6884891 5.718000
## 2048.00 7.585357 0.6884891 5.718000
##
## Tuning parameter 'sigma' was held constant at a value of 0.06775675
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06775675 and C = 8.
## RMSE Rsquared MAE
## 6.004810 0.705378 4.375796
From the above plot and the tabulated result, we can clearly see that the final values used for the model were sigma = 0.06775675 and C = 8. The R^2 value obtained on the test set is 0.7054.
Thus from buliding up all the above models, I have the following table.
| Model | Parameter | Training RMSE | Training R Squared | Testing RMSE | Testing R Squared |
|---|---|---|---|---|---|
| KNN | k as 5 | 9.1336 | 0.5222 | 8.3322 | 0.45 |
| Neural Network without PCA | size as 4 & decay as 0.01 | 0.8409 | 0.9954 | 9.5493 | 0.9987 |
| Neural Network with PCA | size as 2 & decay as 0.01 | 12.0004 | 0.2054 | 9.5493 | 0.2661 |
| Averaged Neural Network | size as 3 & decay as 0.01 | 0.6935 | 0.9973 | 0.3869 | 0.9989 |
| Mars Model with no preprocessing | nprune as 24 & degree as 2 | 2.7309 | 0.9601 | 1.3921 | 0.985 |
| Mars Model with spatial sign removing correlated predictors | nprune = 13 & degree = 1 | 6.1338 | 0.8042 | 5.3556 | 0.8156 |
| Support Vector Machine | C as 8 & sigma as 0.06775675 | 7.5854 | 0.6885 | 6.0048 | 0.7054 |
So from the table we can see that the average neural network model gives best prediction as it has highest R^2 (0.9989) value on the test set. Coming back to the question of neural network sensitive to highly correlated predictors, we can clearly see that neurtal network without PCA (Test R square:- 0.9987) is performing better thsn the one with PCA (Test R square:- 0.2661).
Thus using proprocessing using PCA on neural does not helps.
We will first load the dataset here in order to answer the further question. To ensure I have the data I checked the data dimension and printed out a small subset of the data.
## X1 X2 X3 X4 X5
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## [1] 12.520 1.120 19.405 1.730 1.680
Thus we see that both the predictors and target values are loaded properly. The fingerprints contains the 1107 binary molecular predictors for the 165 compounds, while permeability contains permeability response is the responding variable. It is been told in Exercise 6.2 that the fingerprint predictors are having sparse data. In order to handle this we are going to apply near zero variance to handle this using caret package.
## [1] "Number of predictors left out for modeling is 388"
Thus after removing the near zero variance from the model we have 388 predictors left our in the data. We will use this predictors to bulid the models.
To split the data into a training and a test set, I have 80% of the data put into the train set and left out data samples in the train set. Then using the splitted train set, I will use Leave group out cross validation because the size of the data points is considerably greater than the size of the predictors. Lets bulid the models.
Lets first fit the data with a KNN model. For this I have used train() in the caret package with knn as method. Below we have complete information about the KNN model.
## k-Nearest Neighbors
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 12.18047 0.4193401 8.328473
## 7 12.20318 0.4130314 8.443593
## 9 12.13573 0.4234838 8.385506
## 11 12.38515 0.3965554 8.666109
## 13 12.55781 0.3782623 8.902542
## 15 12.76116 0.3539242 9.259150
## 17 12.92986 0.3381347 9.560620
## 19 13.05106 0.3223688 9.796124
## 21 13.11881 0.3178781 9.944690
## 23 13.25921 0.3072111 10.108553
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
## RMSE Rsquared MAE
## 11.5584796 0.3785895 8.1628793
From the above plot and the tabulated result, we can clearly see that the best model has the k value of 9. That is the best R^2 is obtain when we are considering 9 nearest neighbours. The R^2 value obtained on the test set is 0.3786.
Lets bulid the model using neural network model using train() in caret package with method as nnet. I am setting the decay to be 0, 0.01, 0.1 and size varying from 1 to 10.
## Neural Network
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 14.66894 0.1992654 11.007925
## 0.00 2 13.56363 0.3050131 9.964243
## 0.00 3 13.55893 0.3519219 9.954022
## 0.00 4 13.75327 0.3448457 10.082438
## 0.00 5 12.89814 0.4091014 9.677077
## 0.00 6 14.06830 0.3396351 10.564821
## 0.00 7 13.44795 0.3951566 9.961410
## 0.00 8 14.22428 0.3524452 10.497537
## 0.00 9 14.10839 0.3690021 10.373561
## 0.00 10 14.94362 0.3480160 10.998434
## 0.01 1 13.71719 0.3845546 9.113970
## 0.01 2 13.85347 0.3638802 10.007150
## 0.01 3 13.61979 0.3678061 10.103614
## 0.01 4 13.74590 0.3540830 10.212034
## 0.01 5 14.06074 0.3668699 10.442635
## 0.01 6 14.28716 0.3564290 10.629968
## 0.01 7 14.29278 0.3408776 10.756775
## 0.01 8 14.10389 0.3753960 10.443204
## 0.01 9 14.38443 0.3634419 10.617678
## 0.01 10 14.18870 0.3862505 10.663720
## 0.10 1 14.25657 0.3672503 9.443107
## 0.10 2 14.22994 0.3649989 10.397407
## 0.10 3 14.59384 0.3576527 10.731541
## 0.10 4 14.27325 0.3499417 10.435441
## 0.10 5 15.11034 0.3132665 11.138925
## 0.10 6 14.52704 0.3517566 10.853295
## 0.10 7 14.76204 0.3475786 10.971939
## 0.10 8 14.52536 0.3483075 10.905612
## 0.10 9 15.00350 0.3220190 11.066886
## 0.10 10 15.35468 0.3276643 11.333168
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5 and decay = 0.
## RMSE Rsquared MAE
## 14.2702625 0.2914559 11.1760107
From the above plot and the tabulated result, we can clearly see that the best model of neural network model is of size 5 and decay is 0. The R^2 value obtained on the test set is 0.2915.
Lets bulid the model using mars model using train() in caret package with method as earth. I am setting the degree to be 1, 2, 3 and number of prune varying from 2 to 38. I tried applying spatial sign for preprocess which gave me worst prediction on the test set with 0.23 R square value. Thus I used center and scaling as preprocess step for the Mars model and below the details.
## Multivariate Adaptive Regression Spline
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 12.42146 0.4099226 9.071569
## 1 3 12.49142 0.4018141 8.570705
## 1 4 12.83669 0.3769528 8.739432
## 1 5 13.12187 0.3594738 8.986555
## 1 6 13.15300 0.3661011 9.248789
## 1 7 13.55472 0.3617997 9.734635
## 1 8 13.83565 0.3647982 9.751219
## 1 9 13.85085 0.3657400 9.617669
## 1 10 14.08389 0.3579442 9.765826
## 1 11 14.23037 0.3506204 9.934069
## 1 12 14.67098 0.3320251 10.153010
## 1 13 15.07703 0.3243381 10.438316
## 1 14 15.11239 0.3224550 10.513071
## 1 15 15.46279 0.3116091 10.706421
## 1 16 15.75441 0.3118065 10.880115
## 1 17 15.95799 0.3136765 11.005202
## 1 18 16.19448 0.3090689 11.121403
## 1 19 16.39237 0.3083221 11.231873
## 1 20 16.39173 0.3095404 11.248110
## 1 21 16.35396 0.3119618 11.204453
## 1 22 16.36672 0.3114318 11.218035
## 1 23 16.36672 0.3114318 11.218035
## 1 24 16.42006 0.3102059 11.255753
## 1 25 16.42006 0.3102059 11.255753
## 1 26 16.42006 0.3102059 11.255753
## 1 27 16.42006 0.3102059 11.255753
## 1 28 16.42006 0.3102059 11.255753
## 1 29 16.42006 0.3102059 11.255753
## 1 30 16.42006 0.3102059 11.255753
## 1 31 16.42006 0.3102059 11.255753
## 1 32 16.42006 0.3102059 11.255753
## 1 33 16.42006 0.3102059 11.255753
## 1 34 16.42006 0.3102059 11.255753
## 1 35 16.42006 0.3102059 11.255753
## 1 36 16.42006 0.3102059 11.255753
## 1 37 16.42006 0.3102059 11.255753
## 1 38 16.42006 0.3102059 11.255753
## 2 2 12.73476 0.3810779 9.279446
## 2 3 13.23672 0.3374476 9.174074
## 2 4 13.88950 0.3161362 9.655913
## 2 5 14.00487 0.3202137 9.624312
## 2 6 14.41363 0.3175609 9.743502
## 2 7 14.32989 0.3219546 9.576293
## 2 8 14.27880 0.3291617 9.578573
## 2 9 15.63422 0.2778098 10.123078
## 2 10 16.00711 0.2706481 10.376437
## 2 11 16.37365 0.2610798 10.529966
## 2 12 17.00425 0.2471498 10.807932
## 2 13 16.98567 0.2475079 10.885248
## 2 14 17.39728 0.2412196 11.223152
## 2 15 17.60951 0.2391672 11.336776
## 2 16 17.85653 0.2353091 11.391346
## 2 17 17.80994 0.2411536 11.460800
## 2 18 18.09462 0.2282308 11.646935
## 2 19 18.22320 0.2216317 11.705319
## 2 20 18.33814 0.2176238 11.690920
## 2 21 18.19138 0.2250705 11.605744
## 2 22 18.16959 0.2255261 11.610091
## 2 23 18.22906 0.2250847 11.621930
## 2 24 18.31406 0.2249958 11.653615
## 2 25 18.31501 0.2248615 11.663328
## 2 26 18.38967 0.2215299 11.693625
## 2 27 18.39323 0.2215028 11.712044
## 2 28 18.39323 0.2215028 11.712044
## 2 29 18.39323 0.2215028 11.712044
## 2 30 18.39323 0.2215028 11.712044
## 2 31 18.39323 0.2215028 11.712044
## 2 32 18.39323 0.2215028 11.712044
## 2 33 18.39323 0.2215028 11.712044
## 2 34 18.39323 0.2215028 11.712044
## 2 35 18.39323 0.2215028 11.712044
## 2 36 18.39323 0.2215028 11.712044
## 2 37 18.39323 0.2215028 11.712044
## 2 38 18.39323 0.2215028 11.712044
## 3 2 12.68434 0.3873686 9.237108
## 3 3 13.45065 0.3229941 9.300490
## 3 4 14.29311 0.2862546 9.622268
## 3 5 14.56109 0.2848627 9.724570
## 3 6 14.70754 0.2805936 9.547797
## 3 7 14.97412 0.2909159 9.649438
## 3 8 15.54115 0.2690676 9.922469
## 3 9 15.87209 0.2611280 10.169324
## 3 10 15.92877 0.2628880 10.134016
## 3 11 16.23914 0.2496414 10.273357
## 3 12 16.42686 0.2496481 10.307985
## 3 13 16.65665 0.2472425 10.386725
## 3 14 16.68472 0.2525662 10.353012
## 3 15 16.91827 0.2468946 10.423964
## 3 16 16.87999 0.2553404 10.390833
## 3 17 16.98664 0.2587741 10.431291
## 3 18 17.18408 0.2539071 10.582685
## 3 19 17.22770 0.2580855 10.605322
## 3 20 17.45099 0.2503162 10.701284
## 3 21 17.44492 0.2461660 10.632780
## 3 22 17.57205 0.2450834 10.676506
## 3 23 17.50062 0.2486290 10.678435
## 3 24 17.47367 0.2497434 10.636471
## 3 25 17.47385 0.2495463 10.653408
## 3 26 17.47962 0.2495825 10.652147
## 3 27 17.50926 0.2480192 10.678255
## 3 28 17.50926 0.2480192 10.678255
## 3 29 17.55967 0.2454125 10.755039
## 3 30 17.55967 0.2454125 10.755039
## 3 31 17.55967 0.2454125 10.755039
## 3 32 17.55967 0.2454125 10.755039
## 3 33 17.55967 0.2454125 10.755039
## 3 34 17.55967 0.2454125 10.755039
## 3 35 17.55967 0.2454125 10.755039
## 3 36 17.55967 0.2454125 10.755039
## 3 37 17.55967 0.2454125 10.755039
## 3 38 17.55967 0.2454125 10.755039
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 2 and degree = 1.
## RMSE Rsquared MAE
## 9.8692942 0.5469159 7.4774994
From the above plot and the tabulated result, we can clearly see that the best model of mars model has nprune = 2 and degree = 1. The R^2 value obtained on the test set is 0.5469.
Lets bulid the model using mars model using train() in caret package with method as earth. With a small change here that we will remove the correlated predictors out and then train the model. I am setting the degree to be 1, 2, 3 and number of prune varying from 2 to 38.
## Multivariate Adaptive Regression Spline
##
## 133 samples
## 83 predictor
##
## Pre-processing: centered (83), scaled (83)
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 15.26772 0.1209577 11.79985
## 1 3 15.36755 0.1138346 11.56263
## 1 4 15.01790 0.1388334 11.24805
## 1 5 14.82656 0.1549963 11.02253
## 1 6 14.82341 0.1647039 11.09393
## 1 7 14.63252 0.1883691 10.98275
## 1 8 14.78057 0.1829358 11.07865
## 1 9 14.72215 0.1893190 11.08294
## 1 10 14.69136 0.1894610 11.07120
## 1 11 14.68626 0.1903254 11.02948
## 1 12 14.68626 0.1903254 11.02948
## 1 13 14.68626 0.1903254 11.02948
## 1 14 14.68626 0.1903254 11.02948
## 1 15 14.68626 0.1903254 11.02948
## 1 16 14.68626 0.1903254 11.02948
## 1 17 14.68626 0.1903254 11.02948
## 1 18 14.68626 0.1903254 11.02948
## 1 19 14.68626 0.1903254 11.02948
## 1 20 14.68626 0.1903254 11.02948
## 1 21 14.68626 0.1903254 11.02948
## 1 22 14.68626 0.1903254 11.02948
## 1 23 14.68626 0.1903254 11.02948
## 1 24 14.68626 0.1903254 11.02948
## 1 25 14.68626 0.1903254 11.02948
## 1 26 14.68626 0.1903254 11.02948
## 1 27 14.68626 0.1903254 11.02948
## 1 28 14.68626 0.1903254 11.02948
## 1 29 14.68626 0.1903254 11.02948
## 1 30 14.68626 0.1903254 11.02948
## 1 31 14.68626 0.1903254 11.02948
## 1 32 14.68626 0.1903254 11.02948
## 1 33 14.68626 0.1903254 11.02948
## 1 34 14.68626 0.1903254 11.02948
## 1 35 14.68626 0.1903254 11.02948
## 1 36 14.68626 0.1903254 11.02948
## 1 37 14.68626 0.1903254 11.02948
## 1 38 14.68626 0.1903254 11.02948
## 2 2 15.24019 0.1203556 11.76977
## 2 3 15.22822 0.1251379 11.52902
## 2 4 15.09519 0.1275473 11.30143
## 2 5 15.05492 0.1310002 11.22278
## 2 6 17.84249 0.1311944 11.73136
## 2 7 14.98824 0.1442769 11.14337
## 2 8 17.48682 0.1401701 11.64996
## 2 9 17.48682 0.1401701 11.64996
## 2 10 17.48682 0.1401701 11.64996
## 2 11 17.48682 0.1401701 11.64996
## 2 12 17.48682 0.1401701 11.64996
## 2 13 17.48682 0.1401701 11.64996
## 2 14 17.48682 0.1401701 11.64996
## 2 15 17.48682 0.1401701 11.64996
## 2 16 17.48682 0.1401701 11.64996
## 2 17 17.48682 0.1401701 11.64996
## 2 18 17.48682 0.1401701 11.64996
## 2 19 17.48682 0.1401701 11.64996
## 2 20 17.48682 0.1401701 11.64996
## 2 21 17.48682 0.1401701 11.64996
## 2 22 17.48682 0.1401701 11.64996
## 2 23 17.48682 0.1401701 11.64996
## 2 24 17.48682 0.1401701 11.64996
## 2 25 17.48682 0.1401701 11.64996
## 2 26 17.48682 0.1401701 11.64996
## 2 27 17.48682 0.1401701 11.64996
## 2 28 17.48682 0.1401701 11.64996
## 2 29 17.48682 0.1401701 11.64996
## 2 30 17.48682 0.1401701 11.64996
## 2 31 17.48682 0.1401701 11.64996
## 2 32 17.48682 0.1401701 11.64996
## 2 33 17.48682 0.1401701 11.64996
## 2 34 17.48682 0.1401701 11.64996
## 2 35 17.48682 0.1401701 11.64996
## 2 36 17.48682 0.1401701 11.64996
## 2 37 17.48682 0.1401701 11.64996
## 2 38 17.48682 0.1401701 11.64996
## 3 2 15.30468 0.1225605 11.80172
## 3 3 15.56329 0.1130472 11.79292
## 3 4 15.51354 0.1006252 11.63641
## 3 5 15.63962 0.1115818 11.69784
## 3 6 15.55068 0.1211300 11.59151
## 3 7 15.58878 0.1232556 11.59138
## 3 8 15.52450 0.1270505 11.57711
## 3 9 15.50542 0.1288293 11.56971
## 3 10 15.55495 0.1266223 11.58064
## 3 11 15.55495 0.1266223 11.58064
## 3 12 15.55495 0.1266223 11.58064
## 3 13 15.55495 0.1266223 11.58064
## 3 14 15.55495 0.1266223 11.58064
## 3 15 15.55495 0.1266223 11.58064
## 3 16 15.55495 0.1266223 11.58064
## 3 17 15.55495 0.1266223 11.58064
## 3 18 15.55495 0.1266223 11.58064
## 3 19 15.55495 0.1266223 11.58064
## 3 20 15.55495 0.1266223 11.58064
## 3 21 15.55495 0.1266223 11.58064
## 3 22 15.55495 0.1266223 11.58064
## 3 23 15.55495 0.1266223 11.58064
## 3 24 15.55495 0.1266223 11.58064
## 3 25 15.55495 0.1266223 11.58064
## 3 26 15.55495 0.1266223 11.58064
## 3 27 15.55495 0.1266223 11.58064
## 3 28 15.55495 0.1266223 11.58064
## 3 29 15.55495 0.1266223 11.58064
## 3 30 15.55495 0.1266223 11.58064
## 3 31 15.55495 0.1266223 11.58064
## 3 32 15.55495 0.1266223 11.58064
## 3 33 15.55495 0.1266223 11.58064
## 3 34 15.55495 0.1266223 11.58064
## 3 35 15.55495 0.1266223 11.58064
## 3 36 15.55495 0.1266223 11.58064
## 3 37 15.55495 0.1266223 11.58064
## 3 38 15.55495 0.1266223 11.58064
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 7 and degree = 1.
## RMSE Rsquared MAE
## 14.1887564 0.1371695 11.0669611
From the above plot and the tabulated result, we can clearly see that the best model of mars model has nprune = 7 and degree = 1. The R^2 value obtained on the test set is 0.1372.
Lets bulid the model using support v ector machine using train() in caret package with svmRadial method which uses radial basis function. For svm, I set the tune length to be 14 as it tuneLength argument will use the default grid search of 20 cost values between 2^-2, 2^-1, . . . , 2^11. sigma is estimated analytically by default
## Support Vector Machines with Radial Basis Function Kernel
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%)
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 16.66748 0.01417462 10.59983
## 0.50 16.09675 0.03861270 10.87586
## 1.00 15.62397 0.04326156 11.39807
## 2.00 15.47332 0.02000953 12.20375
## 4.00 15.47650 0.02009189 12.29866
## 8.00 15.47650 0.02009189 12.29866
## 16.00 15.47650 0.02009189 12.29866
## 32.00 15.47650 0.02009189 12.29866
## 64.00 15.47650 0.02009189 12.29866
## 128.00 15.47650 0.02009189 12.29866
## 256.00 15.47650 0.02009189 12.29866
## 512.00 15.47650 0.02009189 12.29866
## 1024.00 15.47650 0.02009189 12.29866
## 2048.00 15.47650 0.02009189 12.29866
##
## Tuning parameter 'sigma' was held constant at a value of 0.002329905
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.002329905 and C = 2.
## RMSE Rsquared MAE
## 11.7033030 0.3989128 9.4466225
From the above plot and the tabulated result, we can clearly see that the final values used for the model were sigma = 0.00233 and C = 2. The R^2 value obtained on the test set is 0.3989.
Thus from buliding up all the above models, I have the following table.
| Model | Parameter | Training RMSE | Training R Squared | Testing RMSE | Testing R Squared |
|---|---|---|---|---|---|
| KNN | k as 9 | 12.1357 | 0.4234 | 11.5585 | 0.3786 |
| Neural Network | size as 5 & decay as 0 | 12.8981 | 0.4091 | 14.2703 | 0.2915 |
| MARS with no preprocessing | nprune as 2 & degree as 2 | 12.4215 | 0.4099 | 9.8693 | 0.5469 |
| MARS Removing correlation | nprune as 7 & degree as 1 | 14.6325 | 0.1884 | 14.1888 | 0.1372 |
| Support Vector Machine | C as 2 & sigma as 0.00233 | 15.4733 | 0.0200 | 11.7033 | 0.3989 |
From the above table we can clearly see that MARS model is performing both on optimal resampling and test set with comparatively high R square value as 0.4099 on training and 0.5469 on the testing set.
Yes, MARS model performs better than the optimal linear model which is a Partial Least Square model with 7 components (Test set R square: 0.4457). MARS model has the test set R square value of 0.5469. This shows that the underlying relationship between the predictors and responding variable is non linear.
No, I won’t recommend the model because the best model MARS itself has a test set R square value of 0.5469 which quite on the lower end.
*** End of Solution ***
# installing the packages
installNewPackage <- function(packageName) {
if(packageName %in% rownames(installed.packages()) == FALSE)
{
install.packages(packageName, repos = "http://cran.us.r-project.org", dependencies=TRUE)
}
}
installNewPackage(“kernlab”)
installNewPackage(“mlbench”)
installNewPackage(“caret”)
installNewPackage(“AppliedPredictiveModeling”)
library(kernlab)
library(mlbench)
library(caret)
library(AppliedPredictiveModeling)
# Code from the book
set.seed(1)
x <- runif(100, min = 2, max = 10)
y <- sin(x) + rnorm(length(x)) * .25
sinData <- data.frame(x = x, y = y)
plot(x, y)
## Create a grid of x values to use for prediction
dataGrid <- data.frame(x = seq(2, 10, length = 100))
head(dataGrid)
# Setting the expand grid for the cost and epsilon parameters just like the last Home work
svm_parameter_grid <- expand.grid(costs = 2^c(-2, 0, 2, 8), epsilons = c(.01, .05, .1, .5))
# Show the grid used
svm_parameter_grid
# Intializing the empty result dataframe
result_df <- data.frame(actual = double(), pred = double(), costs = factor(), epsilons = factor())
# Set the screen split
par(mfrow = c(4, 4))
# Looping over the combination of SVM parameter grid
for(index in 1:nrow(svm_parameter_grid)) {
\# Setting the seed
set.seed(1)
\# Buliding the SVM
rbf_model <- ksvm(x = x, y = y, data = sinData, kernel ="rbfdot", kpar = "automatic", C = svm_parameter_grid$costs[index],
epsilon = svm_parameter_grid$epsilons[index])
\# Make the prediction using the model generated
prediction <- predict(rbf_model, newdata = dataGrid)
plot(sinData$x, sinData$y, main = paste("Cost:", toString(svm_parameter_grid$costs[index]), "| Epsilon:", toString(svm_parameter_grid$epsilons[index])),
xlab = "x", ylab = "y", cex = 2)
points(dataGrid$x, prediction, type = "l", col = "red", lwd = 3)
\# Create the this iteration result data frame
iter_results <- data.frame(actual = dataGrid$x, pred = prediction, costs = paste("Cost:", toString(svm_parameter_grid$costs[index])),
epsilons = paste("Epsilon:", toString(svm_parameter_grid$epsilons[index])))
result_df <- rbind(result_df, iter_results)
}
# Setting seed for expand grid
set.seed(1)
# Setting the expand grid for the cost and epsilon parameters just like the last Home work
svm_parameter_grid <- expand.grid(costs = 2^c(-2, 0, 2, 8), epsilons = c(.01, .05, .1, .5))
sigma_values <- data.frame(sigma = c(0.2043268, 0.9977490, 47.74), color = c(“red”, “blue”, “green”))
# Intializing the empty result dataframe
result_df <- data.frame(actual = double(), pred = double(), costs = factor(), epsilons = factor())
# Set the screen split
par(mfrow = c(4, 4))
# Looping over the combination of SVM parameter grid
for(index in 1:nrow(svm_parameter_grid)) {
plot(sinData$x, sinData$y, main = paste("Cost:", toString(svm_parameter_grid$costs[index]), "| Epsilon:", toString(svm_parameter_grid$epsilons[index])),
xlab = "x", ylab = "y", col = "black", lwd = 2)
for(i in 1:nrow(sigma_values)) {
\# Setting the seed
set.seed(1)
\# Buliding the SVM
rbf_model <- ksvm(x = x, y = y, data = sinData, kernel ="rbfdot", C = svm_parameter_grid$costs[index],
epsilon = svm_parameter_grid$epsilons[index], kpar = list(sigma = sigma_values$sigma[i]))
\# Make the prediction using the model generated
prediction <- predict(rbf_model, newdata = dataGrid)
points(dataGrid$x, prediction, type = "l", col = sigma_values$color[i], lwd = 3)
\# Create the this iteration result data frame
iter_results <- data.frame(actual = dataGrid$x, pred = prediction, costs = paste("Cost:", toString(svm_parameter_grid$costs[index])),
epsilons = paste("Epsilon:", toString(svm_parameter_grid$epsilons[index])))
result_df <- rbind(result_df, iter_results)
}
}
par(mai=c(0,0,0,0))
plot.new()
legend(x = “center”, legend = sigma_values\(sigma, col = sigma_values\)color, lwd=4, cex=3, title = “Sigma Values”, horiz = TRUE)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
# We convert the ‘x’ data from a matrix to a data frame
# One reason is that this will give the columns names.
trainingData\(x <- data.frame(trainingData\)x)
# Look at the data using
featurePlot(trainingData\(x, trainingData\)y)
# This creates a list with a vector ‘y’ and a matrix
# of predictors ‘x’. Also simulate a large test set to
# estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData\(x <- data.frame(testData\)x)
# Set the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the KNN model
if(file.exists(“models/knn_model_q2.rds”)) {
knn_model <- readRDS("models/knn_model_q2.rds")
} else {
knn_model <- train(x = trainingData$x, y = trainingData$y, method = "knn", preProc = c("center", "scale"), tuneLength = 10)
saveRDS(knn_model, "models/knn_model_q2.rds")
}
# Print the model
knn_model
# Plot the model
plot(knn_model)
# Predict the model
knn_pred <- predict(knn_model, newdata = testData$x)
# Get the test Set performance metrics
postResample(pred = knn_pred, obs = testData$y)
# Create the grid for the network
nn_grid <- expand.grid(.decay = c(0, 0.01, 0.1), .size = 1:10)
# Set the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/nnet_model_q2.rds”)) {
nnet_model <- readRDS("models/nnet_model_q2.rds")
} else {
nnet_model <- train(x = trainingData$x, y = trainingData$y, tuneGrid = nn_grid, method = "nnet", preProc = c("center", "scale"),
linout = TRUE, trace = FALSE, MaxNWts = 10 * (ncol(trainingData$x)+1) + 10 + 1, maxit=500)
saveRDS(nnet_model, "models/nnet_model_q2.rds")
}
# Print the model
nnet_model
# Plot the model
plot(nnet_model)
# Predict the test set
nnet_pred <- predict(nnet_model, newdata = testData$x)
# Get the test Set performance metrics
postResample(pred = nnet_pred, obs = testData$y)
# Create the tune grid
tune_grid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
# Setting the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/avg_nnet_model_q2.rds”)) {
avg_nnet_model <- readRDS("models/avg_nnet_model_q2.rds")
} else {
avg_nnet_model <- train(x = trainingData$x, y = trainingData$y, tuneGrid = tune_grid, method = "avNNet", preProc = c("center", "scale"),
linout = TRUE, trace = FALSE, MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1, maxit = 500)
saveRDS(avg_nnet_model, "models/avg_nnet_model_q2.rds")
}
# Print the model
avg_nnet_model
# Plot the model
plot(avg_nnet_model)
# Make the prediction
avg_nnet_pred <- predict(avg_nnet_model, newdata = testData$x)
# Get the performance scores
postResample(pred = avg_nnet_pred, obs = testData$y)
# Create the tune grid
tune_grid <- expand.grid(.degree = 1:3, .nprune = 2:38)
# Setting the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/mars_model_q2.rds”)) {
mars_model <- readRDS("models/mars_model_q2.rds")
} else {
mars_model <- train(x = trainingData$x, y = trainingData$y, tuneGrid = tune_grid, method = "earth")
saveRDS(mars_model, "models/mars_model_q2.rds")
}
# Print the model
mars_model
# Plot the model
plot(mars_model)
# Make the prediction
mars_pred <- predict(mars_model, newdata = testData$x)
# Get the performance scores
postResample(pred = mars_pred, obs = testData$y)
# Setting the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/svm_model_q2.rds”)) {
svm_model <- readRDS("models/svm_model_q2.rds")
} else {
svm_model <- svm_model <- train(x = trainingData$x, y = trainingData$y, tuneLength = 14, method = "svmRadial", preProc = c("center","scale"))
saveRDS(svm_model, "models/svm_model_q2.rds")
}
# Print the model
svm_model
# Plot the model
plot(svm_model)
# Make the prediction
svm_pred <- predict(svm_model, newdata = testData$x)
# Get the performance scores
postResample(pred = svm_pred, obs = testData$y)
mars_mode_imp <- varImp(mars_model)
plot(mars_mode_imp, top = 10)
# Load the data
data(tecator)
absorp_df <- as.data.frame(absorp)
endpoints_df <- as.data.frame(endpoints)
colnames(endpoints_df) = c(“mositure”, “fat”,“protein”)
# Verify the data is loaded or not
head(absorp_df[1:5])
head(endpoints_df)
# Setting the seed for reproduciablity
set.seed(1)
# Performing data spliting
cv_index <- createDataPartition(endpoints[, 2], p = 0.8, list = FALSE)
absorpTrain <- absorp_df[cv_index, ]
absorpTest <- absorp_df[-cv_index, ]
yTrain <- endpoints_df[cv_index, 2]
yTest <- endpoints_df[-cv_index, 2]
# Setting up the control parameter
ctrl <- trainControl(method = “LGOCV”, repeats = 5)
# Set the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the KNN model
if(file.exists(“models/knn_model_q3.rds”)) {
knn_model <- readRDS("models/knn_model_q3.rds")
} else {
knn_model <- train(x = absorpTrain, y = yTrain, method = "knn", preProc = c("center", "scale"), tuneLength = 10)
saveRDS(knn_model, "models/knn_model_q3.rds")
}
# Print the model
knn_model
# Plot the model
plot(knn_model)
# Predict the model
knn_pred <- predict(knn_model, newdata = absorpTest)
# Get the test Set performance metrics
postResample(pred = knn_pred, obs = yTest)
# Create the grid for the network
nn_grid <- expand.grid(.decay = c(0, 0.01, 0.1), .size = 1:10)
# Set the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/nnet_model_q3.rds”)) {
nnet_model <- readRDS("models/nnet_model_q3.rds")
} else {
nnet_model <- train(x = absorpTrain, y = yTrain, tuneGrid = nn_grid, method = "nnet", preProc = c("center", "scale"),
linout = TRUE, trace = FALSE, MaxNWts = 10 * (ncol(absorpTrain)+1) + 10 + 1, maxit=500)
saveRDS(nnet_model, "models/nnet_model_q3.rds")
}
# Print the model
nnet_model
# Plot the model
plot(nnet_model)
# Predict the test set
nnet_pred <- predict(nnet_model, newdata = absorpTest)
# Get the test Set performance metrics
postResample(pred = nnet_pred, obs = yTest)
# Create the grid for the network
nn_grid <- expand.grid(.decay = c(0, 0.01, 0.1), .size = 1:10)
# Set the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/nnet_pca_model_q3.rds”)) {
nnet_pca_model <- readRDS("models/nnet_pca_model_q3.rds")
} else {
nnet_pca_model <- train(x = absorpTrain, y = yTrain, tuneGrid = nn_grid, method = "nnet", preProc = c("center", "scale", "pca"),
linout = TRUE, trace = FALSE, MaxNWts = 10 * (ncol(absorpTrain)+1) + 10 + 1, maxit=500)
saveRDS(nnet_pca_model, "models/nnet_pca_model_q3.rds")
}
# Print the model
nnet_pca_model
# Plot the model
plot(nnet_pca_model)
# Predict the test set
nnet_pca_pred <- predict(nnet_pca_model, newdata = absorpTest)
# Get the test Set performance metrics
postResample(pred = nnet_pca_pred, obs = yTest)
# Create the tune grid
tune_grid <- expand.grid(.decay = c(0, 0.01, .1), .size = 1:10, .bag = FALSE)
# Setting the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/avg_nnet_model_q3.rds”)) {
avg_nnet_model <- readRDS("models/avg_nnet_model_q3.rds")
} else {
avg_nnet_model <- train(x = absorpTrain, y = yTrain, tuneGrid = tune_grid, method = "avNNet", preProc = c("center", "scale"),
linout = TRUE, trace = FALSE, MaxNWts = 10 * (ncol(absorpTrain) + 1) + 10 + 1, maxit = 500)
saveRDS(avg_nnet_model, "models/avg_nnet_model_q3.rds")
}
# Print the model
avg_nnet_model
# Plot the model
plot(avg_nnet_model)
# Make the prediction
avg_nnet_pred <- predict(avg_nnet_model, newdata = absorpTest)
# Get the performance scores
postResample(pred = avg_nnet_pred, obs = yTest)
# Create the tune grid
tune_grid <- expand.grid(.degree = 1:3, .nprune = 2:38)
# Setting the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/mars_model_q3.rds”)) {
mars_model <- readRDS("models/mars_model_q3.rds")
} else {
mars_model <- train(x = absorpTrain, y = yTrain, trControl = ctrl, tuneGrid = tune_grid, method = "earth")
saveRDS(mars_model, "models/mars_model_q3.rds")
}
# Print the model
mars_model
# Plot the model
plot(mars_model)
# Make the prediction
mars_pred <- predict(mars_model, newdata = absorpTest)
# Get the performance scores
postResample(pred = mars_pred, obs = yTest)
# Setting the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/svm_model_q3.rds”)) {
svm_model <- readRDS("models/svm_model_q3.rds")
} else {
svm_model <- train(x = absorpTrain, y = yTrain, trControl = ctrl, tuneLength = 14, method = "svmRadial", preProc = c("center","scale"))
saveRDS(svm_model, "models/svm_model_q3.rds")
}
# Print the model
svm_model
# Plot the model
plot(svm_model)
# Make the prediction
svm_pred <- predict(svm_model, newdata = absorpTest)
# Get the performance scores
postResample(pred = svm_pred, obs = yTest)
data(permeability)
fingerprints_df <- as.data.frame(fingerprints)
head(fingerprints_df[, 1:5])
permeability[1:5]
fingerprints_filtered_df <- fingerprints_df[, -nearZeroVar(fingerprints_df)]
print(paste(“Number of predictors left out for modeling is”, dim(fingerprints_filtered_df)[2]))
# Setting the seed for reproduciablity
set.seed(1)
# Performing data spliting
cv_index <- createDataPartition(permeability, p = 0.8, list = FALSE)
fingerprintsTrain <- fingerprints_filtered_df[cv_index,]
fingerprintsTest <- fingerprints_filtered_df[-cv_index,]
permeabilityTrain <- permeability[cv_index]
permeabilityTest <- permeability[-cv_index]
# Setting up the control parameter
ctrl <- trainControl(method = “LGOCV”)
# Set the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the KNN model
if(file.exists(“models/knn_model_q4.rds”)) {
knn_model <- readRDS("models/knn_model_q4.rds")
} else {
knn_model <- train(x = fingerprintsTrain, y = permeabilityTrain, method = "knn", preProc = c("center", "scale"), trControl = ctrl, tuneLength = 10)
saveRDS(knn_model, "models/knn_model_q4.rds")
}
# Print the model
knn_model
# Plot the model
plot(knn_model)
# Predict the model
knn_pred <- predict(knn_model, newdata = fingerprintsTest)
# Get the test Set performance metrics
postResample(pred = knn_pred, obs = permeabilityTest)
# Create the grid for the network
nn_grid <- expand.grid(.decay = c(0, 0.01, 0.1), .size = 1:10)
# Set the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/nnet_model_q4.rds”)) {
nnet_model <- readRDS("models/nnet_model_q4.rds")
} else {
nnet_model <- train(x = fingerprintsTrain, y = permeabilityTrain, tuneGrid = nn_grid, method = "nnet", preProc = c("center", "scale"), trControl = ctrl,
linout = TRUE, trace = FALSE, MaxNWts = 10 * (ncol(fingerprintsTrain)+1) + 10 + 1, maxit=500)
saveRDS(nnet_model, "models/nnet_model_q4.rds")
}
# Print the model
nnet_model
# Plot the model
plot(nnet_model)
# Predict the test set
nnet_pred <- predict(nnet_model, newdata = fingerprintsTest)
# Get the test Set performance metrics
postResample(pred = nnet_pred, obs = permeabilityTest)
# Create the tune grid
tune_grid <- expand.grid(.degree = 1:3, .nprune = 2:38)
# Setting the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/mars_model_q4.rds”)) {
mars_model <- readRDS("models/mars_model_q4.rds")
} else {
mars_model <- train(x = fingerprintsTrain, y = permeabilityTrain, trControl = ctrl, tuneGrid = tune_grid, method = "earth")
saveRDS(mars_model, "models/mars_model_q4.rds")
}
# Print the model
mars_model
# Plot the model
plot(mars_model)
# Make the prediction
mars_pred <- predict(mars_model, newdata = fingerprintsTest)
# Get the performance scores
postResample(pred = mars_pred, obs = permeabilityTest)
# Remove the highly correlated value
highlyCorDescr <- findCorrelation(fingerprints_filtered_df, cutoff = .75)
fingerprints_filtered_cor_df <- fingerprints_filtered_df[,-highlyCorDescr]
# Split the data
fingerprintsCorTrain <- fingerprints_filtered_cor_df[cv_index,]
fingerprintsCorTest <- fingerprints_filtered_cor_df[-cv_index,]
# Create the tune grid
tune_grid <- expand.grid(.degree = 1:3, .nprune = 2:38)
# Setting the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/mars_model_cor_q4.rds”)) {
mars_model <- readRDS("models/mars_model_cor_q4.rds")
} else {
mars_model <- train(x = fingerprintsCorTrain, y = permeabilityTrain, trControl = ctrl, preProcess = "spatialSign", tuneGrid = tune_grid, method = "earth")
saveRDS(mars_model, "models/mars_model_cor_q4.rds")
}
# Print the model
mars_model
# Plot the model
plot(mars_model)
# Make the prediction
mars_pred <- predict(mars_model, newdata = fingerprintsCorTest)
# Get the performance scores
postResample(pred = mars_pred, obs = permeabilityTest)
# Setting the seed
set.seed(1)
# Check the file exists and load to variables
# else bulid and store the model
if(file.exists(“models/svm_model_q4.rds”)) {
svm_model <- readRDS("models/svm_model_q4.rds")
} else {
svm_model <- train(x = fingerprintsTrain, y = permeabilityTrain, tuneLength = 14, method = "svmRadial",
trControl = ctrl, preProc = c("center","scale"))
saveRDS(svm_model, "models/svm_model_q4.rds")
}
# Print the model
svm_model
# Plot the model
plot(svm_model)
# Make the prediction
svm_pred <- predict(svm_model, newdata = fingerprintsTest)
# Get the performance scores
postResample(pred = svm_pred, obs = permeabilityTest)